NExT-GPT is an open source multimodal language model developed by the National University of Singapore, capable of processing text, images, videos, and audio, providing robust support for multimedia AI applications. It features a three-layer architecture, including linear projection, Vicuna LLM core, and modality-specific transformation layers, with intermediate layer training conducted using MosIT technology. The open-source contribution enables researchers and developers to create applications that integrate multimodal inputs, with potential applications spanning a wide range of fields. What sets NExT-GPT apart is its ability to generate modalities based on user requests.